perm filename SSORT.REM[UP,DOC] blob
sn#405871 filedate 1978-12-26 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00002 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 SHORT WRITEUP FOR SSORT.DMP[1,3], THE VARIABLE-LENGTH STRING SORT/MERGE
C00020 ENDMK
C⊗;
SHORT WRITEUP FOR SSORT.DMP[1,3], THE VARIABLE-LENGTH STRING SORT/MERGE
SSORT means String-SORT/merge. It takes an input file,
breaks it into records according to some particular delimiting
character, sorts the records into ASCII-sequential order (note,
the L switch modifies this sequence), and assembles an output
file with the delimiters restored or replaced by a new delimiter.
It uses simulation of paging, keeping data on disk (or in cache)
except when being sorted, and is fairly efficient when thrashing
of the cache does not occur.
COMMAND TO INVOKE THE PROGRAM IS R SSORT.
INPUT FILE SHOULDN'T HAVE SOS LINE NUMBERS NOR TV/E DIRECTORY, BUT IT MIGHT
WORK ANYWAY. OUTPUT FILE WON'T OVERWRITE AN EXISTING FILE UNLESS YOU SAY YES.
A TEMPORARY FILE WILL BE CREATED AND DELETED, CALLED SOMETHING LIKE <JOBNUM>SORT.TMP
The file is assumed to be broken up into frames by a special delimeter
character, which may be any ASCII character you choose, except nulls which are
usually ignored. There is a special feature called the "JMC" kludge which
allows two consecutive delimiters rather than one to be the delimiter,
thus allowing a blank line between paragraphs to be the frame separator if
linefeed is the delimiter and carriage-return is deleted upon input and
replaced upon output.
When you start up the program, it will type * and if you just type <crlf>
it will remind you of the command format and give you a list of switches.
The command format is <OUTPUT-FILE>←<INPUT-FILE><SWITCHES><CRLF>.
Switches are parsed from left to right, and the effect of the default
switches is to place those switches ahead of whatever switches you type.
Each of the /A /O /B switches overrides any previous delimiter specification
and also resets any /D or /S to the usual values (i.e. /D<nil> /S<delim>).
The /F and /R merely turn on switches to the program, the /D switch
appends to the list of /D characters, and the /S replaces the list of /S
characters.
/A<KH> and /O<OCTAL NUMBER> are used to specify the delimiter character, depending on
whether you want to use the character itself or the octal value for it
(note, for activation characters, you must use the /O form).
/B causes linefeed to be the delimiter, <cr> to be ignored but patched in
when you are done, and the special double-occurrance flag (the JMC feature)
to be turned on. If you want the double-occ flag but some other delimiter,
it might be possible to do /B followed by /A or /O, but this hasn't been
tested as of 1975.APR.22 because no one has had any use for it.
/F causes the program to assume the delimiters are at the front rather than
the usual rear-end of each frame. The major effect is at the start and
end of the file where an extra blank frame or a lost frame might occur.
This should not be done if linefeed is the delimiter because the last
record (line) will be unreadable by many text-processing programs.
/R causes retention of any duplicate frames. Normally, extra copies are
purged. There is no way at present to count the duplicated frames (such
as when counting words in a dictionary). There is no way to output only the
duplicated frames, but a SRCCOM between the output with /R in effect and
the output without /R will somewhat perform this function.
/D<OCTAL NUMBER> causes that character to disappear during input. You
may still resupply it upon output by /S. Carriage-return is normally
handled this way, by SSORT (using default switches) and other SU-AI programs.
/S<OCTAL NUMBER LIST> causes that string (up to 5 characters) to be used
instead of your normal delimiter upon output.
/L (temporary kludge largely replaced by /U) causes the sorting
sequence to be modified so that A < a < B < b < C ... z < [
i.e. the lower-case characters are moved to be just greater
than their upperifications.
/U causes lower-case characters to be modified during actual
sorting to be upper-case equivalents, without modifying the text
in the actual records. Thus, in the absense of the /R switch,
"Aardvark" and "aardVARK" would be considered identical and exactly
one (not supported which one!) would be selected for output; with
/R they would both appear but not in any particular order. /U is
appropriate for sorting most indexes and bibliographies, providing
that no formatting characters are in some records and not in others
to screw up the sorting sequence.
The default switches for SSORT are /O12/D15/D14/S15&12 or
something close enough that you will understand it from
reading this. /O12 means linefeed is delimiter. /D15 and
/D14 mean that all carriage-returns and formfeeds are deleted
during the first input phase. Thus each record stored inside
SSORT will consist of pure text without any <cr> or <ff>
(purged) or <lf> (each linefeed is replaced with
end-of-record mark inside SSORT). /S15&12 means that upon
final output the delimiter (which was linefeed originally,
with a carriage-return in front of it most likely) is
replaced by 15&12 which is carriage-return followed by
linefeed. Thus normal SU-AI files will be put back the way
they were (except that they will end up alphabetized, and all
page-marks will be removed) and other files will be
"fixed-up" into SU-AI format.
TYPICAL USES:
A file of single-line items, such as FORTUN.TXT[2,2]
-- Use default switches and the file will be sorted
line-for-line. Indeed, SSORT was actually used on FORTUN.TXT
which is why it has been in alphabetical order since 1972.
A bibliography consisting of paragraphs of text, with
the significant key at the front of each paragraph, and a
blank line between paragraphs -- Use the /B switch. Each
paragraph will be considered an indivisible unit, and the
paragraphs will be put in correct sequence.
A bibliography like above, but where case of letters
is to be ignored -- Use /B/U.
REM's diary, consisting of items preceeded by a
year-month-date field that consists of the character ∨
followed by the other info. ∨ does not occur elsewhere in the
file, so it can be used as a delimiter without any other
scanning. -- Use the switches /A∨/F which establishes ∨ as
delimiter and assumes it is at the beginning of each record.
A bibliography like described above but with the key
not the first thing in each record. -- SSORT is not equipped
to handle this. You must reformat the file, using SOS or
TECO or a specially-written program, usually written in SAIL,
so that the key is at the start of each record. Then run
SSORT. Then reformat the file back into the desired format.
HOW THIS PROGRAM COMPARES WITH IBM SORT/MERGE ET AL.
SSORT is the only existing sort/merge program that can
handle arbitrarily-long strings in random format. IBM only
handles strings whose length is known a-priori. IBM does
allow the key to be in any fixed location, but again I
believe the number of characters from the start of the record
to the start of the key must be a constant, which is unlike
anything we do here.
Martin Frost has a fast sorter for Balkan Folk Dance
files, lists of records in the Stanford Folk-Dancers record
collection, however the files must be specially formatted for
the sorter to work, and I believe the maximum size is
restricted and the locations of the fields are almost fixed.
MIT TECO has a command for performing sort on the file
under edit. It allows variable-length strings and the user
can define operations to locate the key anywhere in the
string by an arbitrary set of three search commands, one to
move from the start of record to start of key, one to move
from start to end of key, and one to move from end of key to
end of record (beginning of next record). It is not protected
against the user defining contradictory (unreasonable)
searches, which it will dutifuly perform until you run out of
core or call back the monitor after waiting forever while it
sits in a run loop. For more info, consult MRC, who informs
me it is efficient. Unfortunately, MIT-TECO isn't yet here
at SU-AI (as far as I know).
The method SSORT uses is a HEAPSORT on the first word
(5 characters) of each record, followed by HEAPSORTs on the
n+1st words for each batch of records that are identical up
through the nth words. If it reaches END-OF-RECORD in
matching records, then all but one copy of each identical
record are purged, except when the /R switch is enabled. The
actual data is kept on the disk, with a word of exactly 0 to
signify END-OF-RECORD. Since nulls are always purged during
the initial parse, this works.
To reduce the number of disk accesses, a virtual
memory system is simulated in core. Thus if the file is small
then almost the entire sort runs in core, although with much
software overhead. -- If the file is large but partly-sorted,
then paging is very efficient, consisting of a simulation of a
pure merge operation with buffered-mode input on each segment
that is in ascending sequence. (This cannot be done easily on
this system any other way since the number of I/O channels is
limited to '20 whereas SSORT can simulate as many
single-buffered input channels as there are pages simulated in
core, providing that the hashing of disk record numbers is
uniform in bucket space).
If the file is large and not sorted at all but
consists of a reasonably-small number of large records then
the paging is very efficient, consisting of simulating one
bufferred-input for each record during the decide-the-sequence
phase, and consisting of random-access input with one USETI
per record during the final output phase.
If most of the records can be distinguished by their
first five characters, then the first pass of sorting in core
does most of the job, so that almost no paging must be done.
If the file consists of short records, however, the final pass
to write out the records will be many USETIs, one per record,
and hence will be slow although linear in time with respect to
the number of records.
If the file consists of a very large number of similar
and short records, given in totally random sequence (entropic,
not sorted) then thrashing will occur and it will take a very
very long time to do the job. In that case I suggest you
break the file up into pieces that can be sorted in core, and
sort each piece by itself, then concatinate the pieces into a
large file and do one last sort on it (the last pass will be
the case of simulating a pure-merge, so it will be efficient).
If you find an error in this writeup, please add an addenda at
the end of this file:
ADDENDUM by DON: In the above paragraph, "addenda" should be "addendum".
ADDENDUM by RAK: DON is wrong; suppose you find two things wrong?
However "an" should be deleted.